Better data for decision-making through Bayesian imputation of suppressed provisional COVID-19 death counts

Purpose To facilitate use of timely, granular, and publicly available data on COVID-19 mortality, we provide a method for imputing suppressed COVID-19 death counts in the National Center for Health Statistic’s 2020 provisional mortality data by quarter, county, and age. Methods We used a Bayesian approach to impute suppressed COVID-19 death counts by quarter, county, and age in provisional data for 3,138 US counties. Our model accounts for multilevel data structures; numerous zero death counts among persons aged <50 years, rural counties, early quarters in 2020; highly right-skewed distributions; and different levels of data granularity (county, state or locality, and national levels). We compared three models with different prior assumptions of suppressed COVID-19 deaths, including noninformative priors (M1), the same weakly informative priors for all age groups (M2), and weakly informative priors that differ by age (M3) to impute the suppressed death counts. After the imputed suppressed counts were available, we assessed three prior assumptions at the national, state/locality, and county level, respectively. Finally, we compared US counties by two types of COVID-19 death rates, crude (CDR) and age-standardized death rates (ASDR), which can be estimated only through imputing suppressed death counts. Results Without imputation, the total COVID-19 death counts estimated from the raw data underestimated the reported national COVID-19 deaths by 18.60%. Using imputed data, we overestimated the national COVID-19 deaths by 3.57% (95% CI: 3.37%-3.80%) in model M1, 2.23% (95% CI: 2.04%-2.43%) in model M2, and 2.96% (95% CI: 2.76%-3.16%) in model M3 compared with the national report. The top 20 counties that were most affected by COVID-19 mortality were different between CDR and ASDR. Conclusions Bayesian imputation of suppressed county-level, age-specific COVID-19 deaths in US provisional data can improve county ASDR estimates and aid public health officials in identifying disparities in deaths from COVID-19.

Introduction structures and analyze suppressed data as though it were missing [12,18]. Bayesian methods have been applied to address problems with missing or suppressed data [19,20]. Prior investigators used the Poisson-gamma and conditional autoregressive models that account for spatial structure among counties to estimate the distribution of county-level, age-specific mortality rates in CDC WONDER data [11,12]. In addition, Bayesian methods can flexibly integrate information at different levels of granularity to infer the distributions of suppressed and unsuppressed data [13,21]. The 2020 NCHS county-level COVID-19 death data by quarter and age pose a unique statistical challenge because of the nature of the pandemic [4]. Previous studies addressed data suppression in statistics (e.g., mean, count) that were stable, continuously collected and reported [13][14][15][16]. In contrast, COVID-19 death counts increased substantially from quarter 1 to quarter 4 in many US communities during 2020 [1]. Therefore, the dataset contained numerous zero counts in early quarters and an increasing proportion of suppressed data over time. Data suppression in the provisional county-level COVID-19 death counts could hinder decision-making and statistical analyses that measure the impact of COVID-19 mortality among US counties. To account for the underlying distribution of suppressed and unsuppressed COVID-19 death counts by age and county-level attributes, we employed Bayesian methods to impute the suppressed death counts.
In this study, we provide a method for estimating suppressed COVID-19 death counts in the 2020 NCHS county-level provisional US mortality data to support timely analyses of COVID-19 mortality conducted by researchers who do not have access to complete age-specific provisional mortality data, including CDC researchers outside of NCHS and researchers in other government agencies and academia.

Overview
We developed a Bayesian model to impute suppressed COVID-19 deaths by quarter, county of residence, and age for provisional 2020 US death data. In our Bayesian model, all suppressed data are treated as parameters to be estimated. To improve the precision of our estimates, we used age-specific COVID-19 death counts at different levels of granularity to inform the likelihood, including quarterly data at the county level and annual data at the state and national levels. We evaluated prior knowledge of the plausible values of our estimates of the distribution of suppressed data, including a noninformative prior for all suppressed data (M1), the same weakly informative prior for all ages (M2), and different weakly informative priors for suppressed data by age <50 and �50 years (M3). We summarize the relationship between prior, data, likelihood, and posterior distributions in Fig 1. Prior distributions combined with the likelihood (the statistical model summarizing the hypothesized relationship between our measures and COVID-19 deaths) and data result in a posterior distribution that can be evaluated. We compare the three models using performance measures generated at different levels of data aggregation. We demonstrate the importance of age-specific COVID-19 deaths by county by comparing crude (CDR) and age-standardized death rates (ASDR). We used R 4.0.4 and the RStan package to interface between R and Stan, which is a probabilistic programming language for Bayesian inference [22][23][24]. The model code and datasets are available at https:// github.com/syzoekao/COVID19MortImpute. CDC ethics officials reviewed the study protocol to ensure activities were conducted in compliance with applicable federal law and CDC policy (see e.g., 45 C.F.R. part 46, 21 C.F.R. part 56; 42 U.S.C. §241(d); 5 U.S.C. §552a; 44 U.S.C. §3501 et seq).

County-level data
The 2020 provisional data on COVID-19 deaths by quarter, county, and age are available for public use at https://data.cdc.gov/NCHS/AH-Provisional-COVID-19-Deaths-by-Quarter-County-a/ ypxr-mz8e [4]. These provisional data are an ad-hoc file including data received and processed by NCHS through April 22, 2021. Although the dataset was later updated in July 2021, we used the dataset released in April 2021 in our analysis. Due to reporting lag, the provisional data released in April 2021 might underestimate the true number of COVID-19 deaths in 2020 [2,4].
In this provisional dataset, COVID-19 deaths were organized by the county of residence of decedents who died with COVID-19 confirmed or presumed as the underlying or contributing cause of death (ICD-10 code U07.01) [2]. The definition of county-county of residenceused in this provisional dataset differs from that of other NCHS datasets, in which county may be the county of occurrence [25]. Using the decedents' information (county of residence and age at death), COVID-19 death counts were tabulated by eight age groups (0-17, 18-29, 30-39, 40-49, 50-64, 65-74, 75-84, �85) and four quarters for 3,140 US counties in 50 states and the District of Columbia (DC). In this dataset, each row includes a county's COVID-19 death count for an age group and quarter along with the Federal Information Processing Standard (FIPS) county code, the state, and urban-rural code (S1 Table) [26]. Because of the small number of COVID-19 deaths at ages 0-17 years (199 deaths) in 2020 [2,11], we excluded this age group from the analysis. For persons aged �18 years, the dataset contains 87,920 rows for the COVID-19 death counts by age, quarter, and county (7 age groups x 4 quarters x 3,140 counties). In this dataset, data elements are suppressed (i.e., not available for analysis) if counts range from one to nine [4]. Only 46 counties have all data elements present for all age groups and quarters because those counties had zero COVID-19 deaths reported in 2020. Overall, 26.6% of data elements were suppressed, ranging from 6.2% in quarter 1 to 41.1% in quarter 4 (S1 Fig). This nearly 7-fold difference between quarters 1 and 4 was because of the increase in the number of counties that had at least one COVID-19 death in 2020.
To impute suppressed COVID-19 death counts for age groups �18 years, we considered factors associated with the pattern of both suppressed and unsuppressed COVID-19 deaths, including time (quarter), age group, and urban-rural code (S2 Fig). In addition, we considered agespecific population size at the county level to capture the high correlation between population size and COVID-19 deaths [27]. Age-specific population estimates by county were extracted from the 2015-2019 American Community Survey (ACS) 5-year estimates [28]. Two counties, Wade Hampton Census Area in Alaska (FIPS code: 2270) and Shannon County in South Dakota (FIPS code: 46113), had no matched population estimates from the 2015-2019 ACS, and therefore were removed from the imputation process. The final dataset included 3,138 US counties.
Aggregate-level data. We used the NCHS aggregate-level COVID-19 death information to inform the likelihood at the aggregate level (e.g., state-or locality-level, national level) in the Bayesian imputation model (Fig 1) (https://data.cdc.gov/NCHS/Provisional-COVID-19-Deaths-by-Sex-and-Age/9bhg-hcku) [3]. This dataset contains COVID-19 deaths by year, month, sex, and age at the national and state/locality levels, which include all 50 states, New York City, and DC. We used the age-specific COVID-19 death counts by year for both sexes at the state/locality and national levels in this dataset. All age-specific annual COVID-19 deaths were present at the national level, whereas some were suppressed at the state/locality level. To simplify the imputation task in this study and to enable others to reproduce our results, we developed a simple imputation method to impute these suppressed annual COVID-19 deaths at the state/locality level. The details of the simple imputation method for the state-/localitylevel data are described in the supplemental methods (S1 File, Supplemental methods 1).

Developing the Bayesian Model
County-level model. We modeled the distribution of COVID-19 deaths by age, quarter, and county using a Bayesian multilevel hurdle gamma model to infer the posterior distributions of the death counts. We made this modeling decision because of the highly right-skewed unsuppressed data with a large mass at zero deaths (gray bars in S3 Fig), and the nested data structure (age groups nested within quarters, quarters nested within counties). The hurdle gamma model estimates the probability of a county having zero or positive deaths using logistic regression [29] and uses the gamma distribution to approximate the highly right-skewed death data among counties reporting �1 death count [30,31]. We did not consider a Poisson distribution or negative binomial distribution to model count data because Stan has constraints on complex modeling for discrete parameters [32]. A multilevel model can account for the relationships between county-level factors and quarterly trends and the correlated data from counties within the same state [33,34]. The hurdle gamma model was specified as follows: shape � Gammað0:01; 0:01Þ ð3Þ Y aijt denotes the COVID-19 death count for age group a in county i located in state/locality j in quarter t; p aijt is defined as the probability that Y aijt is positive;Χ aijt and Χ aijt represent the design matrices for the parts of zero and positive death counts, respectively. Based on the gamma distribution, the mean (μ) is equal to shape rate aijt . The prior distribution of shape was presented in Eq (3). We used a multilevel logistic regression to model p aijt , and considered a loglink function to estimate the mean (μ) of the gamma distribution. The detailed specification of the logistic regression for p aijt and the log-link function for m are provided in the supplemental methods (S1 File, Supplemental methods 2).
We assumed that suppressed data elements (y miss ) followed the gamma distribution in the hurdle model setting. Because of computational constraints, we did not model mechanisms for suppressed data elements separately from unsuppressed ones, but we modeled these data with varying prior assumptions. We considered and compared three model priors, including noninformative priors characterized by a uniform distribution with support ranging from 0.6 to 9.4 (M1); weakly informative priors characterized by truncated normal distributions with mean 1, standard deviation 10, and the same range of support (M2); and weakly informative priors characterized by truncated normal distributions with mean 1, standard deviation differing by age group (5 for ages 18-49 years and 20 for ages �50 years), and the same range of support (M3). The range of support, 0.6-9.4, was used because posterior samples were rounded to the nearest integer after the samples were generated from the Bayesian model. Rounding the posterior samples allowed the samples to match the death counts within the suppressed data range, which are non-negative integers ranging from one to nine. ( For M2 and M3, weakly informative priors were centered at 1 because we hypothesized the distribution of the suppressed deaths followed a declining right-skewed distribution [31]. Unlike the other two models, M3 accounted for heterogeneity of COVID-19 deaths attributable to age [2,11]. Integrating aggregate-level information. Using only county-level data might lead to wide and unstable variation in the posterior distribution of suppressed data elements. To reduce the variation, we added aggregate-level information-age-specific annual COVID-19 death counts at the national level and state or locality level-to the likelihood [19]. These aggregate-level data, assumed to be independent, were approximated using normal distributions in Eqs (4) and (5).
Y aj and Y a represent the total annual death counts for age group a at the state/locality j and national levels, respectively.Ŷ aj andŶ a are predicted total annual death counts corresponding to Y aj and Y a , respectively. The parameters σ aj and σ a stand for the standard deviations for the state or locality-level and national-level death counts, respectively. Because these aggregate-level data were population counts rather than samples, we did not estimate the standard deviations using traditional epidemiological methods, which infer population parameters from samples. To allow for some degree of uncertainty in the normal approximation, the standard deviations (σ aj and σ a ) were constructed following a rule-based approach. If the national or state or local COVID-19 death count of an age group was more than 5, σ aj or σ a was set to 20% of the death count; if the death count was smaller than 5, σ aj or σ a was set to 1.
To estimateŶ aj andŶ a , we hypothesized that deaths were certified, verified, and counted in the same way that state-and national-level counts are obtained from county-level counts in the real world. We aggregated expected county-level death counts by age and quarter to produce age-specific annual death counts at the state or locality and national levels, respectively. For age group a in county i located in state or locality j in quarter t, the predicted death count (Ŷ aijt ) was a product of the predicted probability of positive death count (p aijt ) and the expected death count (m aijt ) if a positive death count was observed (Eq 6). The estimation ofp aijt andm aijt is described in the supplemental methods (S1 File, Supplemental methods 2). The predicted total annual death count for age group a in state or locality j is the sum of the predicted death counts across quarters and counties for age group a in state/locality j (Eq 7).Ŷ aijt ¼p aijt �m aijt ð6Þ The predicted annual death counts for age group a at the national level (Ŷ a ) was calculated in Eq (8)

Computer simulation
The posterior distribution of the Bayesian imputation model cannot be easily derived. We used Hamiltonian Monte Carlo (HMC) simulation in Stan, which is a probabilistic programming language for Bayesian inference, to simulate samples from the posterior distribution [22]. The HMC estimation process is described in the supplemental methods (S1 File, Supplemental methods 3). After obtaining the fitted model, we simulated COVID-19 deaths by age group, quarter, and county using the fitted model to perform posterior predictive checking, which checks whether the posterior distribution from the fitted model closely approximates the observed data [19,35]. To ease the computational burden and preserve the uncertainty informed from the Bayesian model, 1,000 sets of imputed suppressed data elements were sampled from the posterior distribution and rounded to integers. Suppressed data elements in the original dataset were replaced with a set of imputed data elements, resulting in 1,000 imputation datasets that combined both imputed suppressed data elements and unsuppressed data elements from the original dataset. We conducted simulations for each prior assumption model (M1-M3). To ensure model convergence, the simulation consisted of 3 chains and 4,000 iterations containing 1,000 warmups for each model.

Model performance, comparison, and validation
We assessed model performance at the national, state or locality, and data element levels for each fitted model. At the national level, we estimated the percent bias, which measures the percent relative difference of the estimated death counts to the reported death counts by age group and for overall national death counts [36]. At the state/locality level, we estimated the root mean squared error (RMSE) that measures the difference between the predicted and reported counts at the state/locality level [37,38]. The model performance measures at the national and state/locality levels were generated using the 1,000 datasets by aggregating death counts by county to the corresponding aggregate level. At the data element level, we used the loo package to estimate expected log predictive density through leave-one-out cross-validation (elpd_loo), which measures how well the model performs on new data points [39,40]. These model performance measures were calculated for each model and were used to compare the model fit among all three models. Details of how the model performance and comparison were conducted are described in the supplemental methods (S1 File, Supplemental methods 4).

Estimating crude and age-standardized death rates
To show the importance of age-specific COVID-19 deaths in the provisional data, we estimated different types of death rates at the county level in two scenarios. In scenario 1, we assumed that the provisional data were not available and estimated county-level CDR using cumulative daily COVID-19 death counts by county published by the publicly available data source, USAFacts, on December 31, 2020 [9]. In scenario 2, with the provisional data available, we estimated county-level ASDR [41] using the imputation results and unsuppressed data. We investigated the correlation between county-level CDR or ASDR and county social vulnerability, which is measured by the 2018 CDC/ATSDR Social Vulnerability Index (SVI) for all US counties [42][43][44]. County SVI is a composite indicator that measure different dimensions of county-level emergency preparedness for disastrous events (e.g., minority status, income level, age composition) [42,43]. The relation between county SVI and the COVID-19 pandemic has been studied to understand the vulnerability of US communities to the pandemic [7,45]. We hypothesized that county SVI will be more highly correlated with county-level ASDR than county-level CDR. Furthermore, we listed the top 20 counties that were most affected by COVID-19 mortality in 2020 in descending order of county-level CDR and ASDR, respectively.

Model performance and comparison
Model convergence and the posterior predictive results are reported in the supplemental results (S2 File, Supplemental results 1). In 2020, national provisional data included 384,375 provisional deaths with COVID-19 as the underlying or contributing cause. The number of deaths reported by age groups are shown in Table 1 (column 1). Using county-level provisional data without imputation, we underestimated the official total number of COVID-19 deaths among US residents aged �18 years (N = 384,180) by 18.55%, and consistently underestimated COVID-19 deaths across all age groups, ranging from 12.60% to 76.96%. Using county-level provisional data with imputation, the estimated total number of COVID-19 deaths was 397,910 in M1, 392,767 in M2, and 395,552 in M3. All models overestimated the total number of COVID-19 deaths in the nation by 3.57% (95% CI: 3.37%, 3.80%) for M1, 2.23% (95% CI: 2.05%, 2.43%) for M2, and 2.96% (95% CI: 2.76%, 3.16%) for M3. When comparing death counts by age group across the three models, M1 underestimated death counts the least for persons aged 18-29 years but overestimated death counts the most for the other age groups. For persons aged 40-49 years, in which M1 overestimated deaths by 8.23% (95% CI: 6.70%, 9.84%), M2 and M3 were able to reduce the percent of overestimation to 6.20% (95% CI: 4.70%, 7.74%) and 1.04% (95% CI: -0.41%, 2.40%), respectively. Comparing death counts between M2 and M3, M3 performed better than M2 for those aged 30-49 years, whereas M2 performed better than M3 for other age groups. At the state or locality level, annual deaths by age group were highly correlated between the reported provisional data and the aggregated data from the imputation results for all three models (ρ = 0.9990 in S4 and S5 Figs). Using RMSE (panel [A] in S2 Table), M2 had the best performance at the state or locality level with the lowest RMSE (mean = 68.03; SD = 1.86). At the county level, we assessed model performance with unsuppressed data elements using Pareto k diagnostic values (panel [B] in S2 Table). All models provided estimates consistent with the data for all unsuppressed data elements except for 18 records for M1, 21 records for M2, and 16 records for M3. Most records that the fitted models did not predict well were among older age groups (75-84 years and �85 years), quarters 3 and 4, and noncore counties. Based on elpd_loo, M1 performed slightly better than M2 and M3 in providing estimates consistent with unsuppressed data elements. While the three models had slight differences in model performance, the distribution of county-level COVID-19 deaths generated from the posterior distribution were similar across models (S3 Fig).

Interpretation and application of imputation results
We present the imputation results from M1 as an example to illustrate the interpretation of the results. The proportion of counties by the positive number of COVID-19 deaths (1, 2, . . ., �20 deaths), age group, and quarter calculated from M1 is shown in Fig 2. The distributions incorporating predicted and unsuppressed COVID-19 deaths generally follow a smooth rightskewed shape regardless of quarter or age group. These distributions suggest that among counties reporting positive COVID-19 deaths, the number of COVID-19 deaths that most counties reported was likely to fall within the ranges of suppressed data. Among those aged <65 years, most counties reported one death in all quarters. For those aged �65 years, the number of deaths reported by most counties increased from one death in quarter 1 to two or three deaths in quarters 2-4. The correlation between county SVI and county CDR for COVID-19 in scenario 1 was 0.19, lower than the correlation between county SVI and county ASDR for COVID-19 in scenario 2 (~0.32 for all models) (S6 Fig). In Tables 2 and S3, we presented the top 20 counties most affected by COVID-19 mortality in 2020 by county-level CDR in scenario 1, and ASDR in scenario 2 using the imputation results from M1, M2, and M3. The top 20 counties differed by type of death rate. There were only two counties (Hamlin County, SD and Buffalo County, SD) selected by both CDR and ASDR. Regardless of the type of death rates, most of the top 20 counties were noncore counties or counties with relatively small population size. The top 20 counties selected based on the CDR in scenario 1 had a lower average SVI (40.02) than the average SVI among the top 20 counties selected based on the ASDR in scenario 2 (74.99 for M1 and M3, 77.01 for M2). In addition, we calculated the average  county-level CDR and ASDR among all the counties within each US census division to show how geographic distribution of COVID-19 mortality burden might differ between CDR and ASDR ( Table 3). All 9 US census divisions were ranked by the mean county-level CDR and ASDR, respectively. Compared between CDR and ASDR, the rankings were different among five divisions. Notably, the ranking of East South Central changed from 4 based on CDR to 2 using ASDR.  Table 3.

Mean and interquartile ranges (IQRs) of county-level crude and age-standardized COVID-19 deaths by US census division.
USAFacts Imputation using M1

Discussion
This study illustrates a flexible Bayesian approach to impute suppressed COVID-19 death counts by age, county, and quarter in provisional data. This approach might be useful for facilitating research activities and policy analyses for COVID-19 mortality among public health researchers such as CDC researchers outside of NCHS and researchers in other government agencies and academia. The Bayesian imputation models used different prior assumptions about the unknown distributions of suppressed death counts; integrated provisional death counts at county, state or locality, and national levels; and accounted for excessive zero death counts [13][14][15][19][20][21]. Our study showed that this approach can yield valid and consistent estimates for suppressed data under different prior assumptions. The provisional data combined with the imputed suppressed data elements could be used in further analyses to inform intervention policies to slow the spread of COVID-19.
The Bayesian approach provides the benefit of incorporating prior assumptions into the estimation process [19,35,46]. Research has shown that prior assumptions can influence posterior estimates [20,[46][47][48]. In this study, we explored different prior assumptions for suppressed data and demonstrated model comparison at each level of data granularity. We cannot say which model produces the most valid estimates of suppressed values for several reasons. First, the three prior assumptions selected are for illustrative purposes only, because they are only a subset of all possible prior distributions. Researchers can expand upon our approach by considering more complex prior distributions or incorporating known information about suppressed death counts in choosing prior distributions. Second, the choice of prior distributions is subjective and can vary by analyst. With three levels of data aggregation to choose from, researchers can select the best fitted model based on the most appropriate model performance measures for each aggregate level. In addition, while noninformative priors performed the best at the data element level (elp-d_loo results), it tends to overfit the (unsuppressed) data [19,20,46,48]. If researchers are concerned with the quality of unsuppressed data (e.g., measurement error, systematic data collection issues), weakly informative priors might be a better choice in the estimation process [20,[46][47][48].
In our study, we demonstrated why the imputation results can help inform public health decision-making during a pandemic via the estimation of CDR and ASDR at the county level. Without age-specific COVID-19 deaths, CDR is the primary metric used to understand disparities in deaths from COVID-19 among US counties. However, CDR is influenced by the population's age composition and might be misleading when comparing populations [41]; for COVID-19, the size of the elderly population is the primary driver of death in many communities. ASDR, which removes the influence of confounding age composition in the population, is the alternative. In our analysis, we found that CDR and ASDR resulted in very different rankings of counties most affected by COVID-19 mortality. If death rate is used as the metric for resource allocation to reduce disparity in death from COVID-19, ASDR is a better metric than CDR because CDR might bias resource allocation toward counties with a larger elderly population. In addition, we found that for COVID-19, ASDR was more highly correlated with SVI than CDR at the county level. This suggests that ASDR better reflects factors not directly correlated with age (e.g., poverty, crowded housing). Furthermore, public health researchers can use the imputation results that reduced the bias from suppressed provisional data to investigate the impact of policies such as non-pharmaceutical interventions and vaccine uptake on COVID-19 deaths in a timely manner. As the pandemic continues, this imputation method can be applied to the provisional death data in 2021 and 2022 to facilitate analysis for public health policies in a rapidly changing pandemic.
Our study is subject to several limitations. First, we did not use a discrete distribution for count data due to the computational constraint of Stan [32]. Nonetheless, the gamma distribution can be used to model a right-skewed distribution such as COVID-19 mortality. For future studies, researchers can change the model specifications to distributions that are more appropriate for count data by adapting from the modeling process provided in this study. Second, although variables such as health insurance coverage, income level, and minority status might improve the model fit, we did not include them to reduce computational burden. The model presented in this study cost about two days' worth of computational effort to fit and test; additional variables could make the model more computationally intensive. Third, despite outperforming estimates with no imputation, our models generally overestimated all age groups except for ages 18-29 years. This overestimation issue could be addressed by adding more detailed information (e.g., state-or locality-level quarterly data by age group) [48,49]. In the process of model building, we noticed that as age-specific annual COVID-19 deaths at the state/locality level were added to the likelihood, overestimation was substantially reduced. However, detailed data are more likely to suffer from data suppression. We conducted analyses among different prior distributions only for suppressed data. However, more informative prior distributions could be used in the estimation of fixed and random effects in the model. Fourth, although the structure of our multilevel model accounted for the correlation among the counties within a state, this model did not account for the spatial relationship among adjacent counties [14,15,50]. Future studies can explore models such as conditional autoregressive models, to capture the correlated mortality rate through spatial relationships. Finally, results from this study are only applicable to the release of the provisional COVID-19 mortality data. As the final 2020 mortality data are now available, researchers can apply to access restricted-use data that do not have data suppression through CDC WONDER [51,52]. Nonetheless, this method is still applicable to the nonrestricted-use final COVID-19 mortality data publicly available through CDC WONDER because the nonrestricted-use data still contain data suppression. In addition, this method can be applicable to address the suppressed data in future provisional data (e.g., 2021, 2022), mortality for causes beyond COVID-19, and other types of publicly available datasets.

Conclusion
Data suppression might limit the analytical use of publicly available datasets, especially when the proportion of data suppressed is high. We demonstrated a flexible Bayesian approach that can model the data generating processes and integrate different levels of data granularity to impute suppressed COVID-19 death counts by age group, quarter, and county for 2020. This imputation approach can help uncover age-specific COVID-19 deaths in the provisional data and facilitate further analyses, such as estimating COVID-19 ASDR by county. Compared with CDR, use of ASDR that can only be estimated from the imputation results might lead to different resource allocation decisions intended to reduce disparity in deaths from COVID-19 among US communities.